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Abstract Understanding design principles of biomolecular recognition is a 
key question of molecular biology. Yet the enormous complexity and diver- 
sity of biological molecules hamper the efforts to gain a predictive ability for 
the free energy of protein-protein, protein-DNA, and protein-RNA binding. 
Here, using a variant of the Derrida model, we predict that for a large class 
of biomolecular interactions, it is possible to accurately estimate the relative 
free energy of binding based on the fluctuation properties of their energy 
spectra, even if a finite number of the energy levels is known. We show that 
the free energy of the system possessing a wider binding energy spectrum 
is almost surely lower compared with the system possessing a narrower en- 
ergy spectrum. Our predictions imply that low-affinity binding scores, usually 
wasted in protein-protein and protein-DNA docking algorithms, can be ef- 
ficiently utilized to compute the free energy. Using the results of Rosetta 
docking simulations of protein-protein interactions from Andre et al., Proc. 
Natl. Acad. Sci. U.S.A. 105, 16148 (2008), we demonstrate the power of our 
predictions. 
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1 Introduction 

Recent high-throughput experiments demonstrate a high level of multi-specific 
and non-specific binding in protein-protein [T] , protein-DNA [5] , and protein- 
RNA [3] interactions in a living cell. These observations challenge a conven- 
tional approach of molecular biology usually focusing on just a single pathway 
or function, or a single binding partner for a protein. This suggests that in or- 
der to predict correctly the properties of molecular interaction networks, one 
needs to take into account the effect of multiple binding, essentially comput- 
ing the free energy of the system rather than the energy of individual states. 
The latter statement is quite intuitive as any protein in a cell interacts with 
thousands of proteins (or DNA binding cites) , and even if one (or few) of its 
interaction partners have stronger binding affinity than others, still weaker 
interactions are not negligible and they may become even dominant. Yet the 
complexity of biological molecules and a lack of knowledge of accurate inter- 
molecular interaction potentials hamper computational efforts to predict the 
free energies of protein-protein, protein-DNA, and protein-RNA binding. A 
key question is how to estimate the binding free energy based on the par- 
tial knowledge of the binding energy spectrum. Each energy in the binding 
energy spectrum is defined here as the inter-molecular interaction energy of 
a particular bound state of interacting molecules (e.g., particular binding 
configuration of protein-protein or protein-DNA complex) . 

It was recently shown that global symmetry properties of proteins, both 
on structural and sequence levels, generically define the properties of their 
binding energy spectrum {4,5,6,7,8,9 . In particular, it was shown analyti- 
cally in |3] that the probability distribution for the interaction energies of ho- 
modimers, P(E), is always wider as compared to heterodimers, Ohomo/chetero = 
y/2, where a is the dispersion of P(E). This statistical law was also confirmed 
computationally, using one of the most advanced methods for computing 
protein-protein interactions applied to a large dataset of protein complexes 
from the protein data bank (PDB) [6]. It was also predicted that proteins 
possessing a higher level of structural correlations (clustering) of amino acids 
in their interfaces, demonstrate a wider binding energy spectrum, as well [7]. 
It was shown recently that protein sequences with enhanced strength of di- 
agonal correlations of amino acid positions demonstrate a similar property 
[8j[9] . Intuitively it means that the clustering of amino acids of the same type 
statistically enhances the dispersion of the binding energy spectrum [8,9J. 
Sequences with a higher level of such clustering will possess a larger disper- 
sion than sequences with a lower level of clustering [8,9 . We have recently 
analyzed the properties of the energy spectrum of nonspecific protein-DNA 
binding |10j . Similar to the case of protein- protein interactions, we also ob- 
served that the width of the protein-DNA binding energy spectrum depends 
on the correlation properties of DNA, such as the symmetry and the length- 
scale of DNA sequence correlations [10] . 

We emphasize that in all of those examples the average interaction ener- 
gies of the compared spectra are always identical, and only the dispersions of 
the energy spectra are different, Fig. [T] The predicted effects are thus essen- 
tially governed by the fluctuations of energy, and go beyond the mean field. 
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The case where the average energies are not equal is also discussed below. 
We assume here that the probability distribution, P(E), is Gaussian. This 
is an accurate assertion since, practically, the binding energy, E, is a sum 
of thousands of binary inter-atomic interactions, and this sum is normally 
distributed according the the central limit theorem [T] . 

Here, we estimate the relative free energy of two interacting systems char- 
acterized by the same average binding energies, (Ex) = (E2), but different 
dispersions, o\ > cr 2 , Fig. [IK. We show that the free energy, F, of the system 
possessing a wider binding energy spectrum, is always shifted towards lower 
free energies compared to the system possessing a narrower P(E), even if a 
finite number of energy levels is known, Fig. [TJ\_. In particular, we show that 
the partition function, Z±, is almost surely larger than Z 2 , with the proba- 
bility, P{Z\ > Z 2 ) > 1 — C/M, where C{a\ 1 U2) is a finite constant, and M 
is the number of the energy levels used to compute the partition function. 

We note that in his seminal work [TT] , Derrida has established that in the 
random energy model, where the energy spectrum of the system, P(E), is 
Gaussian, and the partition function, Z = J2iLi exp(— E^/ksT), for each 
realization of P(E) with M energy states, , the quenched average of the 
free energy, (F) q — — ksT (hi Z), is equal to the annealed average, (F) = 
— fc^Tln (Z), in the thermodynamic limit of large M: 

<F) = - fcB TlnM-^, (1) 

if the temperature T is above some critical temperature, T > T c , where 
ksT c ~ a/VhiM [TT], and a is the standard deviation of P(E). Using an 
example from the Rosetta docking simulations of protein-protein interactions, 
we show below that Eq. provides an accurate estimate for the relative 
free energy, when M reaches only few thousands. 

We stress that our model is applicable to interacting systems without a 
pronounced low-energy (ground) state in their energy spectra. The existence 
of such a ground state corresponds to a strong, specific binding. On the 
contrary, a large class of weakly interacting biomolecules in a living cell, 
such as nonspecific protein-protein, protein-DNA, or protein-RNA binding, 
represents the systems where our model is operational. Such relatively weak, 
nonspecific interactions, often called "promiscuous interactions" , have been 
shown to play an important role in different cellular processes, and in many 
cases, the effect of such weak interactions becomes the dominant factor in a 
living cell [12]. 



2 Results 

We consider the ensemble of the interaction energies, {E^}, of two interact- 
ing biomolecules, where each energy, £«, corresponds to a given conforma- 
tion (i.e., a given interaction state), i, of these molecules. 

We begin with the definition of the free energy of the system, F = — In Z, 
where the partition function, Z = YltLi e~ s< ' , and we assume for simplicity 
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Fig. 1 Calculation of the free energy and fluctuations of the free energy from the 
energy spectrum. A. Example: Gaussian probability distributions for the interac- 
tion energy, E, characterized by the identical average energies, (Ei) = {E2), and 

different dispersions, <J\/(T2 = V%- E is represented in the units of ksT. B. Com- 
puted average free energy differences, {AF) = (.F(<7i) —Ffaz)), as a function of 
the ratio of dispersions. Circles with error bars represent the simulation results, 
where the quenched averaging is performed (see the text). We used M = fOOO for 
each computation of the partition function, and the averaging is performed with 
respect to 200 realizations. {AF) is represented in the units of UbT. Error bars 
represent free energy fluctuations, and show two standard deviations. Solid curve 
represents the analytical result, Eq. Q. 

that ksT — 1, and the energy, E, is represented in the units of fc^T; here 
ks is the Boltzmann constant, and T is the absolute temperature. We also 
assume that the energy, E, obeys the Gaussian distribution, P(E), with zero 
mean, {E) = 0, and standard deviation, a. The set of M energy values, E^> , 
is obtained as a statistical realization of P(E). In what follows we compare 
the statistical properties of Z computed based on the realizations drawn from 
two distributions with different values of standard deviation, o\ > 02, Fig. 

ER. 

Since Y — e~ E is a lognormal random variable, it is well-known |14j and 
can be readily verified that the expectation, (Y), and the variance, VAR(Y), 
of Y are given by (Y) = e"' V 2 and VAR(Y) = e 2ff2 - . We note that Eq. 
simply follows from (F) = - In (Af (Y)). 

For a large number M, let Y^\ Y^ 2 \ . . . , Y^ M ' be independent random 
variables, so that each of them is distributed identically to Y. Since Z = 
5Zi=i^*\ by linearity of the expectation, (Z) = M ■ e a / 2 . Also, since 
YW, Y {2 \ Y( M ) are independent, it follows that 

M M 

VAR(Z) = VARQ^yM) = VAR(yW) = M ■ (e 2a2 - e"*) , (2) 

i=l i=l 

and the standard deviation of Z, a{Z) = yf VAR{Z). 

Consider now two normal independent random variables E\ and E% , both 
having zero mean. We assume further that the standard deviation o\ of E\ 
is greater than the standard deviation o"2 of E2, i.e., cr% > 0% > 0. Let 
Yi = e~ El and Y2 = eT E ^ be the corresponding lognormal random variables. 
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Next we show that asymptotically almost surely it holds that Z\ > Z 2 , where 
Lemma 1 Let <7i,<72, <J\ > a% > 0, be two positive constants. Then 



P(Z!>Z 2 ) > l--L.f( ai ,a 3 ) , (3) 

where 

/(ai ' CT2) = 4 _ e *!/2 )2 

is a positive constant that depends only on u\ and o 2 . 

Proof Chebyshev's inequality (see, e.g., [13], p. 43) states that for any random 
variable X with expectation (X) and standard deviation ax, and any b > 0, 

P{\X-{X)\>b-a x ) < 1. (4) 



By the preceding argument (see Eq. (12j)), for i = 1,2, 



(Zt) = M-e^' 2 , a(Zi) = Vm • Ve 2 ^ - e ff . 2 

Let 

= (gi) + (Z 2 ) = e^/ 2 +e g i/ 2 
2 2 



Hence 



(Z 1 )-A = A-(Z 2 ) = iflLJ^Z = M • (5) 



Denote Q = — — — and D = M ■ Q. Observe that since o\ > a 2 , both 
D and Q are positive. Consequently, (Zi) — D = (Z 2 ) + D = A. Also, by 
Chebyshev's inequality (see Eq. Q), 



P{\Z X (Z X ) \>D) = P{\Z X (Z,) \>VM- , ® , • a(Z 1 )) 



\fe 2a i - e a i 

1 e 2<T i - e CT i 



< 



M Q 2 



It follows that 



P{Zi <A) = P(Z 1 < {Zt) -D) = P(({Z X ) - Zi) > D) 

n 2 2 

<P(\(Z 1 )-Z 1 \>D) < 



M Q 2 
Analogously, 

P(Z 2 > A) = P((Z 2 -(Z 2 )) > D) < P(\Z 2 -(Z 2 ) \>D)< 



M Q 2 
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Hence, by union bound, 

P{{Z X > A) and (Z 2 < A)) > 1 - ~ • ^ • ((e 2 ^ - e*?) + {e^ - S)) . 
Finally, 

P{Zi > Z 2 ) > P{(Z 1 > A) and (Z 2 < A)) 

- 1 ~ if ■ h ' ((g2a? _ e<Tl) + {e2al ~ e ° l)) ■ (6) 

Since the right-hand side of the inequality (Eq. (|3|) tends to 1 as M 
grows, it follows that the event Z\ > Z 2 occurs asymptotically almost surely. 
This argument generalizes directly to the scenario when Zt = J2tL\ and 

Z 2 = J2i=i ^2^1 where M\ and M 2 are (not necessarily equal) large integers. 
The generalized inequality is 



It is easy to understand the obtained results intuitively. In the calculation of 
the partition function, Z = X)f=i e ~ B< ' , M energies are drawn from the 
Gaussian distribution. However, only a subset of lowest energies provides the 
dominant contribution to Z. The contribution from high energies is small, 
e,-\ E \ <g; i. Since this dominant subset is localized in the low energy tail, the 
distribution with a larger standard deviation will obviously deliver the larger 
partition function. 

The major practical implication of our result is the ability to estimate 
the relative free energy of biomolecular interactions without performing the 
actual calculations of the free energy. We establish that a simple, direct 
relationship between the average free energy difference and the standard 
deviations of the energy spectra, Eq. ([I]) , 

(AF) = (F(o~i) - F(a 2 )) = (7) 

is accurate for a system where only a finite number of energy levels is known. 
We note that our analytical definition of the average free energy relies on 
the annealed definition of the average, (F) = — hx{Z). In systems without 
frustration the latter definition is known to be in excellent agreement with 
the quenched averaging, (F) = — (In Z) , unlike the case of highly frustrated 
systems such as spin glasses [TT] or proteins below the glass transition tem- 
perature [15] . Indeed, the quenched averaging performed numerically is in 
excellent agreement with the analytical result, Fig.[lj3. The error bars in this 
plot represent the magnitude of the free energy fluctuations. Yet, our cen- 
tral result in this paper is stronger than the statement described by Eq. Q . 
Here we predict for two systems, that even if a single calculation of the free 



PiZ^Mi > Z 2 /M 2 ) > 1 



(e CT ?/ 2 
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Fig. 2 Snapshot of assymetric (we use the term "heterodimeric" to describe such 
symmetry) (A), and symmetric (we use the term "homodimeric" to describe such 
symmetry) (B) binding modes from Rosetta docking simulations of protein L (PDB 
code: lhz6). The structures represent the lowest energy binding modes (using the 
same energy term as in ref. [6], the interchain pair potential in the Rosetta low 
resolution docking energy function) from an assymetric or symmetric docking sim- 
ulation of protein L dimers. Position of centroid atoms, representing the sidechain 
atoms, is shown as spheres. 

energy is performed for each system, using a single realization of the proba- 
bility distributions, P{Ei) and P(E 2 ), and it is known that <j\ > a 2 , then we 
guarantee that F± < F 2 with the probability approaching one, provided that 
the number of measured energy levels, M, in each realization is sufficiently 
large. Finally we note that if one of the distributions, P(Ei), is shifted from 
zero mean by the energy, (E±) = E , the free energy, Eq. ^ gets trivially 
shifted exactly by this magnitude, (AF) = E n — (of — <J 2 )/2. This is because 
the fluctuation contribution to the free energy difference depends exclusively 
on the widths of the corresponding energy spectra. 

3 Example: Free energy of nonspecific protein-protein interaction 

We now apply our results to the calculation of the free energy of nonspe- 
cific protein-protein interactions. We use an example from Andre et al. [5J, 
where the Rosetta docking simulations of self-interacting protein L in ho- 
modimeric and heterodimeric conformations were performed [HJ (see Fig. [2]). 
In particular, these simulations provide the interaction energies of ~ 30, 000 
homodimeric and ~ 22,000 heterodimeric conformations, respectively Fig. 
[3]A Each of these conformations is chosen randomly, without any energy op- 
timization. Based on these energies, we computed the free energy difference 
AF = Fhomo — Fhetero, as a function of the energy sample size, M, where, 
Fhomo = J2iLi ex P(- E hlmo)^ and analogously for F hetero (see Fig. bp). The 
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Fig. 3 Example: Calculation of the free energy of nonspecific self-binding for pro- 
tein L, using the Rosetta docking scores obtained by Andre et al. in ref. [B]. A. 
Computed probability distributions, P(E), of the Rosetta docking energies for pro- 
tein L in symmetric (i.e. homodimeric, shown in black) and nonsymmetric (i.e. 
heterodimeric, shown in gray) conformations [BJ- Snapshots from these Rosetta 
docking simulations are shown in Fig. [2] The interaction energy, E, is in dimension- 
less Rosetta score units. There are overall 29,976 homodimeric, {E ( h 2 m J, and 22,038 

heterodimeric, {fiEg.,}, conformations sampled, respectively. B. Computed free 
energy difference between homodimers and heterodimers, AF = Fh omo — Fhetero, 

as a function of the energy sample size, M, where, Fhomo = YlrLi ex P(~-^itL,,Ji 
and analogously for Fh etero . The horizontal line represents the expectation value, 
{AF} = —(al omo — al etero )/2, Eq. where a homo and cr hetero are the standard 
deviations of the corresponding energy spectra. Inset represents the relative devi- 
ation of AF from the expectation value, 5 = \AF - (AF) \/\AF + (AF) , as a 
function of M. 



key result here is that the free energy of homodimeric conformations is always 
lower then the corresponding free energy of heterodimeric conformations, Fig. 
[3j3. After the sample size, M, reaches only few thousands conformations, the 
free energy difference reaches its expectation value, (AF), with the accuracy 
reaching 90-95% (see inset in Fig. [3j3). The estimate performed above, Eq. 
(§, gives: P(Z homo > Z hetero ) > 0.95, if M = 5000, where Z h 

omo and ^/ieiero 

are the corresponding partition functions, each obtained based on M energy 
values. We suggest therefore that our method should provide an efficient way 
to estimate the free energies of nonspecific biomolecular binding. 
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4 Conclusion 

The majority of macromolecular docking algorithms rejects the lower-affinity 
binding scores and retain only one or few lowest energy conformations. We 
suggested here a simple method based on Derrida-type random energy model 
[IT] , how those wasted scores can be used in order to estimate the free en- 
ergy of binding. Our conclusions can be applicable to different biomolecular 
systems, such as protein-protein, protein-RNA, and protein-DNA complexes. 
The input energy spectra, P{E), may come from different configurations of 
two interacting biomolecules, or they can come from a single biomolecule 
interacting with a set of partner binders. It is important to note that our 
conclusions hold true even when the probability distribution, P(Ei) with a 
larger standard deviation than P(E%), o\ > cr 2 , is sampled by a smaller num- 
ber of states, Mi < M 2 ! The free energy F\ will be reduced compared with 
F2 in the latter case due to the fact that the dominant contribution to the 
partition function comes from the lower-energy tails of P(E\) and P(E2), 
and thus a wider energy spectrum will always deliver a lower free energy. 

In conclusion, we stress that our model is applicable to weakly interact- 
ing systems, without a pronounced energy minima in the interaction energy 
spectrum. We use the term "nonspecific binding" or "promiscuous binding" 
to describe such systems. Nonspecific binding is widespread in a living cell. 
In practice, the majority of the interactions are actually nonspecific. Tradi- 
tionally, such nonspecific interactions are neglected, which leads to significant 
inaccuracies in the computation of the free energy of the system. Here, we 
suggested a possible method to estimate the relative free energies of nonspe- 
cific binding. 

We thank A. Afek, D. Andelman, D. Frenkel, O. Furman (Schueler), W. Gel- 
bart, and E. I. Shakhnovich for helpful discussions. D. B. L. acknowledges 
the financial support from the Israel Science Foundation grant 1014/09. 
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